This tutorial is based on two sources:
https://hbctraining.github.io/Training-modules/IntroR/ by Meeta Mistry, Mary Piper, and Radhika Khetani
https://rspatial.org/raster/sdm/index.html by Robert J. Hijmans and Jane Elith.
This file should have opened in a web browser window. It doesn’t run anything in R by itself; instead you will need to copy and paste (or retype) commands from it.
Whenever you see something like this,
print("hello world") # comments look like this; you don't have to copy them
## [1] "hello world"
the tutorial will display two code boxes.
The box without the two hash characters (##) contains the command, which is the text that you will run in R. To run something in R, simply copy the text in the gray codebox into your console and press Return.
The box with the ## is the result, which should correspond to what will be printed in your console window.
Find the file “installscript.R” in the class network directory where you found this tutorial, and choose RStudio to open it in the RStudio program. RStudio is a development environment for R, which means it provides a graphical interface for writing code in the R programming language.
The RStudio interface has four main panels:
Console: where you can type commands and see output. The console is all you would see if you ran R in the command line without RStudio. It’s the green box in the image below.
Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console. Right now it’s showing “installscript.R”. It’s the red box in the image below.
Environment shows all active objects and History keeps track of all commands run in the console. It’s the blue box in the image below.
Files/Plots/Packages/Help: several different tabs that show the active directory, plots, installed packages (more about this later), help files, etc. It’s the yellow box in the image below.
First you’re going to use the script editor (top left panel). Click “Source” in the top right of this panel to run the “installscript.R” script, which will install all the libraries and custom functions you’ll need for this course. While it’s running, look at the other panels too.
Next find the console (the bottom left panel). When you type something into this command-line interface and hit Enter, the text you entered will be run in the R processor and the results will be returned. Right now you’ll see a bunch of text scrolling by as it installs the packages.
Interpreting the command prompt can help understand when R is ready to accept commands. Below lists the different states of the command prompt and how you can exit a command:
If R is ready to accept commands, the R console shows a > prompt. Can you find this on your own screen?
When the console receives a command (by directly typing into the console or running from the script editor (Ctrl-Enter), R will try to execute it. After running, the console will show the results and come back with a new > prompt to wait for new commands.
If R is still waiting for you to enter more data because the code sent to the console isn’t a complete command yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. Often this can be due to you having not ‘closed’ a parenthesis or quotation.
If you can’t figure out why your command isn’t running, you can click inside the console window and press the Escape key to escape the command and bring back a new prompt >; then you can start over sending the command.
Once your libraries are finished installing and it shows a command prompt, run the following command by typing or pasting it into the console and hitting Enter:
getwd()
## [1] "/Users/jblois/Documents/GitHub/biodata_shortcourse/development"
This should show you where in the computer’s file structure is your current working directory. (It will NOT look like the result above.) If you look in the “Files” tab in the bottom right panel, you will see all the objects in this directory, which you can also get using the following command:
dir()
## [1] "assemble-paleoclimate.R" "biodata_BobcatSTEM.Rproj"
## [3] "Blois_Day1.RData" "climate"
## [5] "climatelayers.RData" "climatemodel.RData"
## [7] "course_overview.html" "course_overview.Rmd"
## [9] "data-cleaning.R" "day1_tutorial.html"
## [11] "day1_tutorial.Rmd" "day2_tutorial.html"
## [13] "day2_tutorial.Rmd" "day3_tutorial_old.Rmd"
## [15] "day3_tutorial.html" "day3_tutorial.Rmd"
## [17] "fix-paleoclimate.R" "gbif-download.R"
## [19] "gbif.RData" "images"
## [21] "installscript.R" "Lecture slides_updated.pptx"
## [23] "neotoma_lonlat.RData" "neotoma-download.R"
## [25] "neotoma-raw.RData" "paleoclimate"
## [27] "species-range.R" "temp"
Try some other stuff to see how this works.
9+6 #you can just use it as a calculator
## [1] 15
sum(9,6) #you can also use functions instead of arithmetic symbols. In this case, the word "sum" indicates the function, which is acting on the values within the parentheses.
## [1] 15
Now, try something deliberately wrong. Copy and paste this line of code into your console, then press Enter:
9+6+
If you look at your console, you will see that instead of an answer
(15), you see the + underneath a line of code that says
9+6+. To complete the equation, type a 0 after
the + within the console. You have now ‘closed’ the line of code and
gotten your answer.
Remember, you can always click inside the console window and press the Escape key to escape the mistake and bring back a new prompt >; then you can start over sending the command.
Now try the script editor (top left window in RStudio). Open up a new script by navigating to File –> New File –> R script. Once you have a blank script open, paste in the following:
# I am adding 3 and 5!
3 + 5
It didn’t run just because you wrote it in the script and not in the
console. Highlight the pasted text within your script and hit
Ctrl+Enter (or click Run in the top right corner of the
pane): the highlighted text will be sent to the console and your result
will appear.
This is useful for when you need to run the same command multiple times, such as when you’re trying to get something right – that’s why it’s called the “editor”. You should make a habit of writing your commands in the code editor instead of the console, because then you can easily go back to your script later to see exactly how you did it.
Notice that the statement “I am adding 3 and 5!” in your script
started with the comment symbol, #. What happens
if we do that same command without the #? Re-run the
command after removing the # sign in the front:
I am adding 3 and 5! Now R is trying to run that sentence
as a command, and it doesn’t work. We get an error in the console
“Error: unexpected symbol in”I am”” means that the R interpreter did not
know what to do with that command. Things sent to the console
won’t work unless they are properly constructed commands in the R
language.
Use the # character to insert comments about what your code is doing. This, again, makes it easier to understand your own work later.
To do useful and interesting things in R, we need to assign
values to variables using the assignment
operator, <-. For example, we can use the
assignment operator to assign the value of 3 to a variable
named x by running:
x <- 3
The assignment operator (<-) assigns values
on the right to variables on the left.
A variable in computer programming is a symbolic name for a location where information can be maintained and referenced. You can think of a variable like a “bucket” of information with a label on the outside. When referring to the bucket of information, we use the label on the bucket (the variable name), not the data stored in the bucket (the value).
In the example above, we created a variable or a ‘bucket’ called
x. Inside we put a value, 3.
Let’s create another variable called y and give it a
value of 5.
y <- 5
When assigning a value to an variable, R does not print anything to the console. You can tell it to print the value by typing the variable name:
y
## [1] 5
You can also view information on all the currently stored variables
by looking in your Environment window in the upper
right-hand corner of the RStudio interface.
Now we can reference these buckets by name to perform mathematical operations on the values contained within. What do you get in the console for the following operation?
x+y
## [1] 8
Try assigning the results of this operation to another variable
called number.
result <- x + y
result
## [1] 8
x to 5 using the
assignment operator. What happens to result? Does it
change?y to contain the
value 10. What do you need to do to update the variable
result to the new value of x + y? Show your results
to an instructor. ***Variables can be given almost any name, such as x,
current_temperature, or subjectID. However,
there are some rules / suggestions you should keep in mind:
X is different from
x)2x is not
valid but x2 is)if, else, for). In general, even
if it’s allowed, it’s best to not use other function names (e.g.,
c, T, mean, data) as
variable names. – You can type ? followed by the name to
see if the name is already in use by a built-in function.Variables can contain values of specific types within R. The most common basic data types in R include:
"numeric" for any numerical value"character" for text values, denoted by using quotes
(““) around value"logical" for TRUE and FALSE
(the boolean data type)The table below provides examples of each of the commonly used data types:
| Data Type | Examples |
|---|---|
| Numeric: | 1, 1.5, 20, pi |
| Character: | “anytext”, “5”, “TRUE” |
| Logical: | TRUE, FALSE, T, F |
We know that variables are like buckets, and so far we have seen that
bucket filled with a single value. Even when
result`` was created, the result of the mathematical operation was a single value. **Variables can store more than just a single value, they can store a multitude of different data structures.** These include, but are not limited to, vectors (c), factors (factor), matrices (matrix), data frames (data.frame) and lists (list`).
A vector is the most common and basic data structure in R, and is
pretty much the workhorse of R. It can be constructed with the
combine command, c(). It’s basically just a
collection of values, mainly either numbers,
c(1, 40, 9, 22)
## [1] 1 40 9 22
or characters,
c("a", "b", "c", "q")
## [1] "a" "b" "c" "q"
or logical values.
c(TRUE, TRUE, FALSE, TRUE)
## [1] TRUE TRUE FALSE TRUE
Note that all values in a vector must be of the same data type. If you try to create a vector with more than a single data type, R will try to coerce it into a single data type. For example, if you were to try to create the following vector:
c("a", 9, 12, TRUE)
## [1] "a" "9" "12" "TRUE"
R will turn it into the following by forcing (“coercing”) all the
values to character type:
[1] "a" "9" "12" "TRUE"
The analogy for a vector is that your bucket now has different compartments; these compartments in a vector are called elements. Each element contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a single bucket.
Let’s create a vector of specimen counts and assign it to a variable
called specCounts. Run the following lines:
specCounts <- c(3000, 50000, 46)
specCounts
## [1] 3000 50000 46
Each element of this vector contains a single numeric value, and
three values will be combined together into a vector using
c() (the combine function). All of the values are
put within the parentheses and separated with a comma.
Looking in your Environment tab, you can see that the
specCounts variable you just created is numeric, starts at
element 1 and ends at element 3 (i.e. it’s a vector containing 3 numeric
values).
A vector can also contain characters. Run the following code to
create another vector called species with three elements,
where each element corresponds with the previous vector.
species <- c("crocodile", "trout", "panda")
species
## [1] "crocodile" "trout" "panda"
A matrix in R is a collection of vectors of the
same length and type. Vectors can be combined as
columns in the matrix or by row, to create a 2-dimensional
structure.
Matrices are used commonly as part of the mathematical machinery of
statistics. We don’t create these manually very often, but they’re very
commonly used inside R functions. They are usually of numeric datatype
and used in computational algorithms to serve as a checkpoint. For
example, if input data is not of identical data type (numeric,
character, etc.), the matrix() function will throw an error
and stop any downstream code execution.
A data.frame is the most common data structure in R for
storing data in tables, and it’s what we use for statistics and
plotting. A data.frame is similar to a matrix in that it’s
a collection of vectors of the same length and each
vector represents a column. However, in a dataframe each vector
can be of a different data type (e.g., characters, integers,
factors).
A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.
We can create a dataframe by bringing vectors
together to form the columns. We do this using the
data.frame() function. We give the function the different
vectors we would like to bind together, and it creates the data frame.
This function will only work for vectors of the same
length.
df <- data.frame(species,specCounts)
df
## species specCounts
## 1 crocodile 3000
## 2 trout 50000
## 3 panda 46
You can see that there are two columns, each one containing one of the input vectors.
Lists are a data structure in R that can be perhaps a bit daunting at first, but soon become amazingly useful. A list is a data structure that can hold any number of any types of other data structures, one after another.
If you have variables of different data structures you wish to
combine, you can put all of those into one list object by using the
list() function and placing all the items you wish to
combine within parentheses.
Run the following to construct a list called “list1” that contains all the data structures we’ve seen so far in this tutorial.
list1 <- list(result, species, specCounts)
list1
## [[1]]
## [1] 8
##
## [[2]]
## [1] "crocodile" "trout" "panda"
##
## [[3]]
## [1] 3000 50000 46
There are three components corresponding to the three different variables we passed in, and what you see is that the structure of each is retained.
A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.
The general usage for a function is the name of the function followed by parentheses:
function_name(input)
The input(s) are called arguments, which can include:
Not all functions take arguments, for example:
getwd()
However, most functions take one or more arguments. If you don’t specify a required argument when calling the function, you will receive an error. Other arguments are optional: if you don’t include them, the function will fall back on using a default. The defaults represent standard values that the author of the function specified as being “good enough in standard cases”, but if you want something specific, simply change the argument to the value of your choice.
We have already used a few examples of basic functions in the
previous lessons i.e getwd(), c(), and
data.frame(). These functions are available as part of R’s
built in capabilities, and we will explore a few more of these base
functions below.
You can also get functions from external packages or libraries, or even write your own.
Let’s revisit the function c() that we have used
previously to combine data into vectors. The arguments it takes
are a collection of numbers, characters or strings (separated by a
comma). The c() function performs the task of combining all
the numbers or characters provided as arguments into a single vector.
You can also pass an existing vector as one of the arguments in order to
add elements to it:
specCountsLonger <- c(900,specCounts) #adds the new value at the beginning
#or
specCountsLonger <- c(specCounts,900) #adds the new value at the end
What happens here is that we take the original vector
specCounts (containing three elements), and add another
item to one end. You can imagine doing this over and over again to build
a vector.
Since R is used for statistical computing, many of the base functions
involve mathematical operations. If interested, we have linked a detailed
guide for performing basic statistical tests in R. One example of a
base R mathematical function would be sqrt(). The
input/argument must be a number, and the the output is the square root
of that number. Let’s try finding the square root of 81:
sqrt(81)
## [1] 9
Now what would happen if we called the function (e.g. ran the function), on a vector of values instead of a single value?
sqrt(specCounts)
## [1] 54.77226 223.60680 6.78233
In this case the function was called on each individual value of the
vector specCounts and the respective results were
displayed. Beware: this does not work with every function!
Let’s try another function, this time using one that we can change
some of the options (arguments that change the behavior of the
function), for example round:
round(3.14159)
## [1] 3
We can see that we get 3. That’s because the default is
to round to the nearest whole number. What if we want a
different number of significant digits? How would we change the
default?
The best way of finding out this information is to use the help
operator ? followed by the name of the function. Doing this
will open up the help manual in the bottom right panel of RStudio that
will provide a description of the function, usage, arguments, details,
and examples:
?round
If you scroll through the help file for the function, you will see a
lot of details - different but related functions (ie,
ceiling); Usage examples (here it lists the default values
as well); detail on the input / arguments; lots more details; and
Examples.
You can also use the example() function to run the
examples from the help file. (This one has a lot of examples!)
example(round)
##
## round> round(.5 + -2:4) # IEEE / IEC rounding: -2 0 0 2 2 4 4
## [1] -2 0 0 2 2 4 4
##
## round> ## (this is *good* behaviour -- do *NOT* report it as bug !)
## round>
## round> ( x1 <- seq(-2, 4, by = .5) )
## [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
##
## round> round(x1) #-- IEEE / IEC rounding !
## [1] -2 -2 -1 0 0 0 1 2 2 2 3 4 4
##
## round> x1[trunc(x1) != floor(x1)]
## [1] -1.5 -0.5
##
## round> x1[round(x1) != floor(x1 + .5)]
## [1] -1.5 0.5 2.5
##
## round> (non.int <- ceiling(x1) != floor(x1))
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## [13] FALSE
##
## round> x2 <- pi * 100^(-1:3)
##
## round> round(x2, 3)
## [1] 0.031 3.142 314.159 31415.927 3141592.654
##
## round> signif(x2, 3)
## [1] 3.14e-02 3.14e+00 3.14e+02 3.14e+04 3.14e+06
If you are already familiar with the function but just need to remind yourself of the names of the arguments, you can use:
str(round)
## function (x, digits = 0, ...)
This tells us that we can change the number of digits returned by
adding an optional argument. We can type
digits = 2 or however many we may want:
round(3.14159, digits = 2)
## [1] 3.14
Another commonly used base function is mean(). Use this
function to calculate an average for the specCounts vector,
and show your result to the instructor. (If you look at the help file,
you will see that the arguments for the mean() function are
supplied in a different data structure than the other functions
we’ve seen so far.)
The last thing we’re going to cover in this introduction is how to inspect data.
When analyzing data, we often want to partition the data so that we are only working with selected columns or rows. A data frame or data matrix is simply a collection of vectors combined together. So let’s begin with vectors and how to access different elements, and then extend those concepts to dataframes.
If we want to extract one or several values from a vector, we must
provide one or several indexes using square brackets [ ]
syntax. The index represents the location of the element within
a vector (or the compartment number, if you think of the bucket
analogy). R indexes start at 1.
Let’s start by creating a vector called age:
age <- c(15, 22, 45, 52, 73, 81)
Suppose we only wanted the second value of this vector, we would use the following syntax:
age[2]
## [1] 22
If we wanted all values except the second value of this vector, we would use the following:
age[-2]
## [1] 15 45 52 73 81
If we wanted to select more than one element we would still use the square bracket syntax, but rather than using a single value we would pass in a vector of several index values:
idx <- c(3,5,6) # create vector of the elements of interest
age[idx]
## [1] 45 73 81
To select a sequence of continuous values from a vector, we would use
: which is a special operator that creates numeric vectors
of integers in increasing or decreasing order. Let’s select the
first four values from age:
age[1:4]
## [1] 15 22 45 52
Practice: Try reversing that to say 4:1 and see what
happens!
Selection of values can also be performed using logical expressions. Logical operators include greater than (>), less than (<), and equal to (==). We can use logical expressions to determine whether a particular condition is true or false. Then, subset out the TRUE values:
age[age > 50]
## [1] 52 73 81
More details about using logical expressions to subset data can be found here
We’re going to use the built-in data set called iris.
This single dataframe contains the measurements in centimeters of the
variables sepal length, sepal width, petal length and petal width for 50
flowers from each of 3 species of iris, a total of 150 specimens. The
species are Iris setosa, I. versicolor, and I.
virginica.
This is a small dataframe, so you can just look at it in the console first.
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Or check how many rows and columns it has with
dim():
dim(iris)
## [1] 150 5
However, 150 lines is still a little inconvenient if you just want to see what the data in each column are generally like. Try this:
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Now you see just the first 6 lines, as well as the header (column
names). Each row holds information for a single specimen, and the
columns contain information about the specimen’s measurements and
species. What data type is each column? Check using str(),
which we used before to inspect the arguments of a function. When you
call it on a variable, it tells you about the data structure and
types.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Uh-oh, what’s a factor data type? Just a character field with a restricted set of possible values called “levels”. Don’t worry about that for now.
(You can also look at this in a separate tab in RStudio. Within the
“Environment” panel, choose “package:datasets” from the dropdown that
currently says “Global Environment”. Then click on iris in
the Environment tab to open the data table in a new tab in the same pane
as the script editor.)
Dataframes (and matrices) have 2 dimensions (rows and columns), so if we want to select some specific data from it we need to specify the index for each dimension. We use the same square bracket notation but rather than providing a single index, there are two indexes. Within the square bracket, row numbers come first followed by column numbers, and the two are separated by a comma; i.e., dataframe[row,column]
iris[1, 1] # element from the first row in the first column of the data frame
## [1] 5.1
iris[1, 3] # element from the first row in the 3rd column
## [1] 1.4
To select whole rows, you provide only the index for the rows and leave the columns index blank. The key here is to include the comma, to let R know that you are accessing a 2-dimensional data structure:
iris[3, ] #returns a vector containing all elements in the 3rd row
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 4.7 3.2 1.3 0.2 setosa
If you were selecting specific columns from the data frame - the rows are left blank:
iris[ , 3] #returns a vector containing all elements in the 3rd column
## [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
## [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
## [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
## [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
## [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
## [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
## [109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
## [127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
## [145] 5.7 5.2 5.0 5.2 5.4 5.1
Just like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values:
iris[ , 1:2] #returns a dataframe containing first two columns
## Sepal.Length Sepal.Width
## 1 5.1 3.5
## 2 4.9 3.0
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5.0 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5.0 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## 11 5.4 3.7
## 12 4.8 3.4
## 13 4.8 3.0
## 14 4.3 3.0
## 15 5.8 4.0
## 16 5.7 4.4
## 17 5.4 3.9
## 18 5.1 3.5
## 19 5.7 3.8
## 20 5.1 3.8
## 21 5.4 3.4
## 22 5.1 3.7
## 23 4.6 3.6
## 24 5.1 3.3
## 25 4.8 3.4
## 26 5.0 3.0
## 27 5.0 3.4
## 28 5.2 3.5
## 29 5.2 3.4
## 30 4.7 3.2
## 31 4.8 3.1
## 32 5.4 3.4
## 33 5.2 4.1
## 34 5.5 4.2
## 35 4.9 3.1
## 36 5.0 3.2
## 37 5.5 3.5
## 38 4.9 3.6
## 39 4.4 3.0
## 40 5.1 3.4
## 41 5.0 3.5
## 42 4.5 2.3
## 43 4.4 3.2
## 44 5.0 3.5
## 45 5.1 3.8
## 46 4.8 3.0
## 47 5.1 3.8
## 48 4.6 3.2
## 49 5.3 3.7
## 50 5.0 3.3
## 51 7.0 3.2
## 52 6.4 3.2
## 53 6.9 3.1
## 54 5.5 2.3
## 55 6.5 2.8
## 56 5.7 2.8
## 57 6.3 3.3
## 58 4.9 2.4
## 59 6.6 2.9
## 60 5.2 2.7
## 61 5.0 2.0
## 62 5.9 3.0
## 63 6.0 2.2
## 64 6.1 2.9
## 65 5.6 2.9
## 66 6.7 3.1
## 67 5.6 3.0
## 68 5.8 2.7
## 69 6.2 2.2
## 70 5.6 2.5
## 71 5.9 3.2
## 72 6.1 2.8
## 73 6.3 2.5
## 74 6.1 2.8
## 75 6.4 2.9
## 76 6.6 3.0
## 77 6.8 2.8
## 78 6.7 3.0
## 79 6.0 2.9
## 80 5.7 2.6
## 81 5.5 2.4
## 82 5.5 2.4
## 83 5.8 2.7
## 84 6.0 2.7
## 85 5.4 3.0
## 86 6.0 3.4
## 87 6.7 3.1
## 88 6.3 2.3
## 89 5.6 3.0
## 90 5.5 2.5
## 91 5.5 2.6
## 92 6.1 3.0
## 93 5.8 2.6
## 94 5.0 2.3
## 95 5.6 2.7
## 96 5.7 3.0
## 97 5.7 2.9
## 98 6.2 2.9
## 99 5.1 2.5
## 100 5.7 2.8
## 101 6.3 3.3
## 102 5.8 2.7
## 103 7.1 3.0
## 104 6.3 2.9
## 105 6.5 3.0
## 106 7.6 3.0
## 107 4.9 2.5
## 108 7.3 2.9
## 109 6.7 2.5
## 110 7.2 3.6
## 111 6.5 3.2
## 112 6.4 2.7
## 113 6.8 3.0
## 114 5.7 2.5
## 115 5.8 2.8
## 116 6.4 3.2
## 117 6.5 3.0
## 118 7.7 3.8
## 119 7.7 2.6
## 120 6.0 2.2
## 121 6.9 3.2
## 122 5.6 2.8
## 123 7.7 2.8
## 124 6.3 2.7
## 125 6.7 3.3
## 126 7.2 3.2
## 127 6.2 2.8
## 128 6.1 3.0
## 129 6.4 2.8
## 130 7.2 3.0
## 131 7.4 2.8
## 132 7.9 3.8
## 133 6.4 2.8
## 134 6.3 2.8
## 135 6.1 2.6
## 136 7.7 3.0
## 137 6.3 3.4
## 138 6.4 3.1
## 139 6.0 3.0
## 140 6.9 3.1
## 141 6.7 3.1
## 142 6.9 3.1
## 143 5.8 2.7
## 144 6.8 3.2
## 145 6.7 3.3
## 146 6.7 3.0
## 147 6.3 2.5
## 148 6.5 3.0
## 149 6.2 3.4
## 150 5.9 3.0
iris[c(1,3,6), ] #returns a dataframe containing first, third and sixth rows
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
For larger datasets, it can be tricky to remember which column number corresponds to a particular variable. In some cases, the column number for a variable can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.
iris[1:3 , "Petal.Length"] # values of the Petal.Length column from the first three rows/samples.
## [1] 1.4 1.4 1.3
You can also select and do operations on a particular column, by
selecting it using the $ sign. In this case, the entire
column is a vector. For instance, to extract all the species names from
our dataset, we can use:
iris$Species
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
You can use names() or colnames() to remind
yourself of the column names. We can then supply index values to select
specific values from that vector. For example, if we wanted the petal
widths for the first five samples in iris:
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
iris$Petal.Width[1:5]
## [1] 0.2 0.2 0.2 0.2 0.2
The $ allows you to select a single column by name,
which is a one-dimensional vector that requires only one index and no
commas. To select multiple columns by name, you need to make a vector of
strings that correspond to column names and supply it to the dataframe
name:
iris[, c("Petal.Length", "Petal.Width")]
## Petal.Length Petal.Width
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## 4 1.5 0.2
## 5 1.4 0.2
## 6 1.7 0.4
## 7 1.4 0.3
## 8 1.5 0.2
## 9 1.4 0.2
## 10 1.5 0.1
## 11 1.5 0.2
## 12 1.6 0.2
## 13 1.4 0.1
## 14 1.1 0.1
## 15 1.2 0.2
## 16 1.5 0.4
## 17 1.3 0.4
## 18 1.4 0.3
## 19 1.7 0.3
## 20 1.5 0.3
## 21 1.7 0.2
## 22 1.5 0.4
## 23 1.0 0.2
## 24 1.7 0.5
## 25 1.9 0.2
## 26 1.6 0.2
## 27 1.6 0.4
## 28 1.5 0.2
## 29 1.4 0.2
## 30 1.6 0.2
## 31 1.6 0.2
## 32 1.5 0.4
## 33 1.5 0.1
## 34 1.4 0.2
## 35 1.5 0.2
## 36 1.2 0.2
## 37 1.3 0.2
## 38 1.4 0.1
## 39 1.3 0.2
## 40 1.5 0.2
## 41 1.3 0.3
## 42 1.3 0.3
## 43 1.3 0.2
## 44 1.6 0.6
## 45 1.9 0.4
## 46 1.4 0.3
## 47 1.6 0.2
## 48 1.4 0.2
## 49 1.5 0.2
## 50 1.4 0.2
## 51 4.7 1.4
## 52 4.5 1.5
## 53 4.9 1.5
## 54 4.0 1.3
## 55 4.6 1.5
## 56 4.5 1.3
## 57 4.7 1.6
## 58 3.3 1.0
## 59 4.6 1.3
## 60 3.9 1.4
## 61 3.5 1.0
## 62 4.2 1.5
## 63 4.0 1.0
## 64 4.7 1.4
## 65 3.6 1.3
## 66 4.4 1.4
## 67 4.5 1.5
## 68 4.1 1.0
## 69 4.5 1.5
## 70 3.9 1.1
## 71 4.8 1.8
## 72 4.0 1.3
## 73 4.9 1.5
## 74 4.7 1.2
## 75 4.3 1.3
## 76 4.4 1.4
## 77 4.8 1.4
## 78 5.0 1.7
## 79 4.5 1.5
## 80 3.5 1.0
## 81 3.8 1.1
## 82 3.7 1.0
## 83 3.9 1.2
## 84 5.1 1.6
## 85 4.5 1.5
## 86 4.5 1.6
## 87 4.7 1.5
## 88 4.4 1.3
## 89 4.1 1.3
## 90 4.0 1.3
## 91 4.4 1.2
## 92 4.6 1.4
## 93 4.0 1.2
## 94 3.3 1.0
## 95 4.2 1.3
## 96 4.2 1.2
## 97 4.2 1.3
## 98 4.3 1.3
## 99 3.0 1.1
## 100 4.1 1.3
## 101 6.0 2.5
## 102 5.1 1.9
## 103 5.9 2.1
## 104 5.6 1.8
## 105 5.8 2.2
## 106 6.6 2.1
## 107 4.5 1.7
## 108 6.3 1.8
## 109 5.8 1.8
## 110 6.1 2.5
## 111 5.1 2.0
## 112 5.3 1.9
## 113 5.5 2.1
## 114 5.0 2.0
## 115 5.1 2.4
## 116 5.3 2.3
## 117 5.5 1.8
## 118 6.7 2.2
## 119 6.9 2.3
## 120 5.0 1.5
## 121 5.7 2.3
## 122 4.9 2.0
## 123 6.7 2.0
## 124 4.9 1.8
## 125 5.7 2.1
## 126 6.0 1.8
## 127 4.8 1.8
## 128 4.9 1.8
## 129 5.6 2.1
## 130 5.8 1.6
## 131 6.1 1.9
## 132 6.4 2.0
## 133 5.6 2.2
## 134 5.1 1.5
## 135 5.6 1.4
## 136 6.1 2.3
## 137 5.6 2.4
## 138 5.5 1.8
## 139 4.8 1.8
## 140 5.4 2.1
## 141 5.6 2.4
## 142 5.1 2.3
## 143 5.1 1.9
## 144 5.9 2.3
## 145 5.7 2.5
## 146 5.2 2.3
## 147 5.0 1.9
## 148 5.2 2.0
## 149 5.4 2.3
## 150 5.1 1.8
While there is no equivalent $ syntax to select a row by
name, you can select specific rows using the row names (in this case
just numbers).
rownames(iris)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12"
## [13] "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24"
## [25] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36"
## [37] "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48"
## [49] "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60"
## [61] "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72"
## [73] "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96"
## [97] "97" "98" "99" "100" "101" "102" "103" "104" "105" "106" "107" "108"
## [109] "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
## [121] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144"
## [145] "145" "146" "147" "148" "149" "150"
iris[c("100", "150"),]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 100 5.7 2.8 4.1 1.3 versicolor
## 150 5.9 3.0 5.1 1.8 virginica
Another way of partitioning dataframes is using the
subset() function to return the rows of the dataframe for
which the logical expression is TRUE. This allows us to the subset the
data in a single step. The syntax for the subset() function
is:
subset(dataframe, column_name == "value") Any logical
expression could replace the `== “value”. For example, we can look at
the samples of the species setosa only:
subset(iris, Species == "setosa")
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
Look at the results of the following commands.
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
mean(subset(iris,Species == "setosa")$Petal.Width)
## [1] 0.246
mean(subset(iris,Species == "versicolor")$Petal.Width)
## [1] 1.326
mean(subset(iris,Species == "virginica")$Petal.Width)
## [1] 2.026
In this section, we will create a map that shows everywhere on earth that a species has been found. We each can choose our own species, using the options in the file “Species List for Day 1.xlsx”. The examples will be done using the species Morpho menelaus, the blue morpho butterfly. Wherever you see its name, you’ll substitute your own species’ name.
Step 1: Choose your favorite species from the list. The function
gbif requires the scientific name of the species. So, if
you don’t know it, open up your web browser and search for the
scientific name. Example search: “blue morpho butterfly scientific
name”. Note that there are several different species, all of which have
the common name “Blue morpho”, which is one reason it’s more precise to
use the scientific names!
For this species, Morpho menelaus, Morpho is the genus and menelaus is called the specific epithet (species indicator).
Step 2: Once you’ve chosen a species, navigate to https://www.gbif.org/ within your browser. Type in your species name in the search window, then hit the “Occurrences” along the top of the search bar. You may need to select ‘Yes’ to the question ‘Do you want to limit your search to this taxon only?’ on the top left of the results page. Examine the occurrences.
Step 3: Open a new script (File>New File>R Script). Hit Ctrl+S to save it with the name “species-range” or something like that. It will prompt you to save it in the test folder of your login, which is fine.
You will use a function in the R package dismo to
download your data. Paste the following into your new script file,
substituting your species name, and run it by selecting it and hitting
Ctrl+Enter or clicking “Run” in the top right corner of the script
pane:
require(dismo)
## Loading required package: dismo
## Loading required package: raster
## Loading required package: sp
gbif('Morpho','menelaus',geo = FALSE, download = FALSE) #find out the total number of occurrences for this species in the database -- if this doesn't match the number of occurrences you see on the website, you should see if you typed something wrong!
## [1] 1874
raw_data <- gbif('Morpho','menelaus',geo = TRUE) #download all the occurrences with longitude and latitude data, which may not be all of them
## 1874 records found
## 0-300-600-900-1200-1500-1800-1874 records downloaded
Inspect the data you downloaded:
df <- raw_data #copy the GBIF download file into another data frame so we can start cleaning it. We do this so that we are not modifying the original data we downloaded.
dim(df) #GBIF returns a LOT of columns!
## [1] 1874 176
Look at the column names:
colnames(df)
## [1] "acceptedNameUsage"
## [2] "acceptedScientificName"
## [3] "acceptedTaxonKey"
## [4] "accessRights"
## [5] "adm1"
## [6] "adm2"
## [7] "associatedReferences"
## [8] "associatedSequences"
## [9] "basisOfRecord"
## [10] "behavior"
## [11] "bibliographicCitation"
## [12] "catalogNumber"
## [13] "class"
## [14] "classKey"
## [15] "cloc"
## [16] "collectionCode"
## [17] "collectionID"
## [18] "collectionKey"
## [19] "continent"
## [20] "coordinatePrecision"
## [21] "coordinateUncertaintyInMeters"
## [22] "country"
## [23] "crawlId"
## [24] "dataGeneralizations"
## [25] "datasetID"
## [26] "datasetKey"
## [27] "datasetName"
## [28] "dateIdentified"
## [29] "day"
## [30] "depth"
## [31] "depthAccuracy"
## [32] "disposition"
## [33] "distanceFromCentroidInMeters"
## [34] "dynamicProperties"
## [35] "elevation"
## [36] "elevationAccuracy"
## [37] "endDayOfYear"
## [38] "establishmentMeans"
## [39] "eventDate"
## [40] "eventID"
## [41] "eventRemarks"
## [42] "eventTime"
## [43] "eventType"
## [44] "family"
## [45] "familyKey"
## [46] "fieldNotes"
## [47] "fieldNumber"
## [48] "footprintSRS"
## [49] "footprintWKT"
## [50] "fullCountry"
## [51] "gbifID"
## [52] "gbifRegion"
## [53] "genericName"
## [54] "genus"
## [55] "genusKey"
## [56] "geodeticDatum"
## [57] "georeferencedBy"
## [58] "georeferencedDate"
## [59] "georeferenceProtocol"
## [60] "georeferenceRemarks"
## [61] "georeferenceSources"
## [62] "georeferenceVerificationStatus"
## [63] "habitat"
## [64] "higherClassification"
## [65] "higherGeography"
## [66] "higherGeographyID"
## [67] "hostingOrganizationKey"
## [68] "http://unknown.org/captive_cultivated"
## [69] "http://unknown.org/language"
## [70] "http://unknown.org/modified"
## [71] "http://unknown.org/nick"
## [72] "http://unknown.org/orders"
## [73] "http://unknown.org/recordEnteredBy"
## [74] "http://unknown.org/recordID"
## [75] "identificationID"
## [76] "identificationReferences"
## [77] "identificationRemarks"
## [78] "identificationVerificationStatus"
## [79] "identifiedBy"
## [80] "identifier"
## [81] "individualCount"
## [82] "informationWithheld"
## [83] "infraspecificEpithet"
## [84] "installationKey"
## [85] "institutionCode"
## [86] "institutionID"
## [87] "institutionKey"
## [88] "isInCluster"
## [89] "ISO2"
## [90] "isSequenced"
## [91] "iucnRedListCategory"
## [92] "key"
## [93] "kingdom"
## [94] "kingdomKey"
## [95] "language"
## [96] "lastCrawled"
## [97] "lastInterpreted"
## [98] "lastParsed"
## [99] "lat"
## [100] "license"
## [101] "lifeStage"
## [102] "locality"
## [103] "locationID"
## [104] "locationRemarks"
## [105] "lon"
## [106] "materialEntityID"
## [107] "materialEntityRemarks"
## [108] "modified"
## [109] "month"
## [110] "municipality"
## [111] "nameAccordingTo"
## [112] "nomenclaturalCode"
## [113] "occurrenceID"
## [114] "occurrenceRemarks"
## [115] "occurrenceStatus"
## [116] "order"
## [117] "orderKey"
## [118] "organismID"
## [119] "organismQuantity"
## [120] "organismQuantityType"
## [121] "originalNameUsage"
## [122] "otherCatalogNumbers"
## [123] "ownerInstitutionCode"
## [124] "parentNameUsage"
## [125] "phylum"
## [126] "phylumKey"
## [127] "preparations"
## [128] "previousIdentifications"
## [129] "programmeAcronym"
## [130] "projectId"
## [131] "protocol"
## [132] "publishedByGbifRegion"
## [133] "publishingCountry"
## [134] "publishingOrgKey"
## [135] "recordedBy"
## [136] "recordNumber"
## [137] "references"
## [138] "reproductiveCondition"
## [139] "rights"
## [140] "rightsHolder"
## [141] "sampleSizeUnit"
## [142] "sampleSizeValue"
## [143] "samplingEffort"
## [144] "samplingProtocol"
## [145] "scientificName"
## [146] "scientificNameID"
## [147] "sex"
## [148] "species"
## [149] "speciesKey"
## [150] "specificEpithet"
## [151] "startDayOfYear"
## [152] "subfamily"
## [153] "superfamily"
## [154] "taxonConceptID"
## [155] "taxonID"
## [156] "taxonKey"
## [157] "taxonomicStatus"
## [158] "taxonRank"
## [159] "taxonRemarks"
## [160] "tribe"
## [161] "type"
## [162] "typeStatus"
## [163] "typifiedName"
## [164] "verbatimCoordinateSystem"
## [165] "verbatimElevation"
## [166] "verbatimEventDate"
## [167] "verbatimIdentification"
## [168] "verbatimLabel"
## [169] "verbatimLocality"
## [170] "verbatimSRS"
## [171] "verbatimTaxonRank"
## [172] "vernacularName"
## [173] "vitality"
## [174] "waterBody"
## [175] "year"
## [176] "downloadDate"
Look at some of the fields for the first six rows:
head(df)[,c("species","continent","country","adm1","lat","lon")]
## species continent country adm1 lat lon
## 1 Morpho menelaus SOUTH_AMERICA Brazil Rio de Janeiro -22.421437 -42.72357
## 2 Morpho menelaus NORTH_AMERICA Costa Rica Puntarenas 8.619720 -83.47618
## 3 Morpho menelaus SOUTH_AMERICA Ecuador Napo -0.946811 -77.86990
## 4 Morpho menelaus SOUTH_AMERICA Brazil Rondônia -9.879575 -62.83085
## 5 Morpho menelaus SOUTH_AMERICA Peru Madre de Dios -12.225983 -69.11453
## 6 Morpho menelaus SOUTH_AMERICA Brazil Espírito Santo -19.066258 -40.14829
Even though we specified geo=TRUE in our download, not
all the occurrences are associated with exact coordinates. You can see
this by examining the values in the ‘lat’ column - some are numbers,
some are “NA”.
df$lat
## [1] -22.421437 8.619720 -0.946811 -9.879575 -12.225983 -19.066258
## [7] -3.780987 -6.064480 -5.351419 -5.333967 -5.334458 -5.334078
## [13] -5.365805 -5.365753 -5.399935 -5.367108 -19.500424 -21.927737
## [19] 4.818926 -19.500252 4.559838 1.181890 4.845483 4.846287
## [25] 4.559448 4.614516 4.850600 4.852784 4.850377 4.816111
## [31] 4.850355 -19.894333 4.324474 -12.569283 4.846330 4.846330
## [37] 4.846492 4.747376 -20.344616 4.937918 4.711524 -9.303196
## [43] 3.611642 4.583960 8.644759 -13.809266 -19.384084 1.263482
## [49] -20.237344 -13.748128 -19.151444 -13.517420 1.191305 -19.891585
## [55] 8.621288 50.992780 -4.542322 9.811421 9.579149 9.600211
## [61] 9.600211 9.600211 9.264426 8.514605 -7.130298 5.520283
## [67] 4.951519 -20.122749 4.672428 -9.594988 -9.595508 4.861367
## [73] -23.487708 -23.300117 -20.256110 4.884912 -9.596210 -1.471529
## [79] -14.133985 -19.564458 -10.548402 3.862786 -22.482838 -6.131725
## [85] -21.860191 -0.492308 0.297361 -4.953959 -4.953959 -14.051504
## [91] -2.690120 -9.595911 -15.797515 -9.201110 -8.497624 -3.249010
## [97] -0.947220 9.380656 -0.644075 4.749904 -6.078563 4.747688
## [103] 9.379315 8.653168 8.621111 8.594728 8.311056 8.962379
## [109] 8.962379 9.128991 9.128991 9.117710 8.629521 8.640116
## [115] 8.640116 8.629521 1.078775 -22.586237 51.215340 -19.884822
## [121] 5.231796 0.430856 -12.436112 -5.956770 4.802023 4.802103
## [127] 4.948531 -7.352031 3.606272 4.636310 8.618011 8.618011
## [133] 9.128991 -15.564841 -22.584860 -6.062733 -1.196038 4.079598
## [139] 4.277997 4.277997 -19.007645 4.705225 -22.575515 3.285173
## [145] -2.552399 -5.981843 -7.105303 -2.651278 -23.763152 4.931193
## [151] 4.876801 -13.563002 -10.570198 5.124188 -1.292812 -7.118858
## [157] -7.115096 -16.225374 4.695541 1.102817 -19.977343 -8.934522
## [163] -6.641455 -7.177139 -3.998387 9.166311 9.165129 9.166311
## [169] 9.165129 -0.525955 -21.724566 -21.067681 -12.896993 4.940300
## [175] 9.372337 8.621080 8.637262 8.629564 9.120612 4.497143
## [181] -3.469904 4.171241 -19.154134 9.382699 9.382699 -3.118039
## [187] -6.069765 -6.176564 -23.014957 5.105623 -6.136080 -6.164994
## [193] 5.475397 4.647578 -1.744070 -1.651950 -1.744710 -1.725640
## [199] -1.663760 -1.668910 -1.666190 -19.150469 -1.915662 3.933889
## [205] -4.292860 -8.779898 -19.151077 8.448322 8.448843 9.174080
## [211] -9.247283 4.392002 9.128487 8.641485 -3.007096 -6.166503
## [217] -6.170613 4.899158 5.073611 -12.520339 -12.602535 -12.612850
## [223] -12.610083 -22.998989 -22.997564 2.985182 3.275450 -20.182722
## [229] -21.351531 -21.022766 -21.337994 4.637108 -19.020479 12.560491
## [235] -17.141545 -17.084532 0.030507 5.806028 -0.639606 -20.989517
## [241] -0.639565 4.583895 -1.451277 -1.198754 4.880159 4.602963
## [247] -1.031409 4.873762 -5.943541 -1.007590 -12.382073 -13.033549
## [253] 4.170289 -13.033549 8.152850 -10.963043 -27.195314 8.637262
## [259] 8.621080 -0.464842 -0.469946 4.889988 -0.477259 4.898391
## [265] -15.865398 -9.598495 -23.434530 -9.597494 3.892480 4.161371
## [271] -9.597581 5.450330 0.033860 4.867750 4.339475 4.344530
## [277] 5.496960 -0.520903 -4.240269 -1.104396 3.621452 8.478589
## [283] 9.134002 4.956705 -12.535989 -9.959420 -25.616750 -0.046257
## [289] -1.462441 -15.733222 -0.996406 -6.075373 -0.993889 -0.993612
## [295] -0.994270 -0.996315 -0.990805 -0.991284 -0.991295 -0.991257
## [301] 5.070786 5.316863 -5.994185 -2.009421 0.046925 0.046925
## [307] 0.046925 -2.801244 -2.817921 4.724493 4.724670 4.806228
## [313] -12.330908 -1.072632 -12.600330 -0.614629 -5.674653 51.211800
## [319] 51.211800 -3.438712 4.831168 -9.019088 4.637630 -5.365989
## [325] -5.365938 -5.365961 -5.333961 -5.366143 -5.366143 -5.366107
## [331] -5.366038 -5.364730 -5.364427 -5.364773 -5.371189 -5.366143
## [337] -5.366143 4.943200 -0.527644 0.133861 0.654024 0.048250
## [343] 0.048250 0.142194 1.123611 3.900480 3.900480 -6.832741
## [349] 0.138806 1.287806 1.285639 8.392663 -9.955906 -9.955156
## [355] -9.955663 -9.954633 -1.429993 6.372135 -22.436482 -22.436481
## [361] -15.873367 5.348903 5.348903 -14.096583 -12.568718 -4.161945
## [367] -9.597507 7.301117 7.350550 -9.597507 -0.674358 -1.429216
## [373] -11.854378 5.348903 5.367658 -2.812197 5.348903 -9.775909
## [379] -9.583139 -9.246550 5.348903 -15.735737 -15.629070 -9.417660
## [385] -19.329356 5.290370 -13.534361 -17.354317 -9.645191 -13.534361
## [391] -9.756327 -15.442202 -13.534361 -9.435092 -15.729922 -15.729922
## [397] -15.733955 -4.248655 -1.065085 -3.007447 -15.464492 -23.999964
## [403] 10.717817 8.664617 8.664617 8.664617 8.664617 10.418362
## [409] -19.890869 -22.603365 4.886644 -11.217304 -12.913517 4.857988
## [415] 4.857988 -9.978787 -9.597200 -0.616315 -20.459327 1.123611
## [421] 1.123611 1.123611 1.123611 1.123611 1.123611 1.123611
## [427] 1.123611 -12.225920 -13.519565 -5.987587 0.970882 4.284750
## [433] 5.462587 -1.115459 4.930000 4.245448 -12.337929 -19.981210
## [439] -0.676227 -16.665542 -15.865125 -22.586237 -22.586237 3.801169
## [445] 3.801287 5.757319 -3.271340 -3.249000 5.255068 -7.146626
## [451] 10.412242 -0.674358 -15.867350 -11.240289 -15.794639 -11.464624
## [457] 9.390973 -21.790704 7.119333 7.119333 7.131167 7.130556
## [463] 7.130556 7.131167 7.169639 3.753938 4.207365 3.960829
## [469] 9.679075 3.844282 3.596610 2.234540 4.961593 -9.597495
## [475] -8.477452 4.600000 4.628000 4.600000 4.600000 10.409167
## [481] -12.679130 -12.679656 -12.679656 -1.051367 -9.327545 -16.056372
## [487] 4.143682 -15.938923 -0.438419 9.119982 10.415557 10.419444
## [493] 10.417500 4.945597 4.550672 10.416556 -9.597507 -9.597507
## [499] -20.760286 -9.597601 -9.597601 -9.597601 5.321280 -10.877051
## [505] -0.638063 -0.638063 -0.434600 10.421667 10.978489 10.978489
## [511] NA 5.147153 10.419444 4.890723 -20.124211 9.154720
## [517] 10.409444 -16.527353 9.657446 4.559955 4.554902 10.408611
## [523] 1.259297 -1.046389 -12.615212 6.380780 -12.607466 3.282412
## [529] NA 10.421667 NA 4.089000 4.089000 10.408611
## [535] 4.828579 -2.448518 -8.041215 -13.540703 4.831660 10.408611
## [541] -2.541381 -2.541381 -2.541381 -9.597507 -9.597507 5.949730
## [547] 4.552620 4.552620 4.552620 4.552620 4.552620 4.552620
## [553] 10.417500 10.420456 10.416556 3.292210 -0.253300 -22.968072
## [559] -20.239484 -20.308725 NA NA NA NA
## [565] NA NA NA 9.924149 9.974529 4.038000
## [571] 4.098000 4.098000 4.098310 4.038000 4.565141 10.420833
## [577] 10.419444 10.409667 10.420556 4.552620 4.552620 4.552620
## [583] 4.552620 4.552620 4.552620 8.658924 8.658924 -12.957540
## [589] 8.689572 8.689572 -1.708100 -1.779583 9.203782 8.656437
## [595] NA 10.420556 9.925414 9.546111 9.546111 -22.966849
## [601] 9.925130 9.925414 9.924149 8.658924 9.928085 -1.759450
## [607] -1.726433 -1.702583 -9.597261 9.925130 -1.708100 -1.731850
## [613] 10.201667 -20.305196 9.154720 10.409444 10.410278 9.243181
## [619] NA 9.925130 -1.752233 10.416556 10.417164 4.098723
## [625] 10.416556 NA -1.733383 -1.756183 -1.718550 NA
## [631] -1.711083 -1.786183 -1.706167 -1.702583 -1.725550 -1.748517
## [637] -1.705950 -1.730033 -1.718550 -1.759450 -1.732633 -1.706167
## [643] -1.723733 -1.719567 -12.603419 9.154720 -1.703400 -1.752233
## [649] -1.725200 -1.723900 9.571849 -1.705950 4.602220 -1.706167
## [655] -1.730033 -1.725550 -1.734767 -1.723900 -1.718550 4.558889
## [661] 4.558889 10.417500 8.658924 10.417500 -1.703400 -1.723733
## [667] -1.723900 -1.737917 -1.706167 -1.708400 -1.705950 -1.706167
## [673] -1.706167 -1.782433 -1.725200 -19.153516 0.803056 -1.727433
## [679] -1.723900 -1.718550 8.536614 -1.719567 10.408611 10.410250
## [685] -1.706167 -4.005833 -1.734767 -1.748517 -1.756183 -1.756183
## [691] -1.777133 -1.752233 -1.730967 -1.755933 -1.748517 -1.775650
## [697] 8.560906 NA -1.706167 -1.737917 -1.786183 -1.780983
## [703] -1.703400 -1.727583 -1.728467 NA NA -1.727150
## [709] -1.722283 -1.705950 -1.759450 -1.721067 -1.711267 -1.734167
## [715] -1.780983 -1.706167 -1.702583 -6.069710 -6.069710 NA
## [721] -1.706167 -1.759450 -1.723900 -1.721067 -1.708400 -1.756183
## [727] -1.719567 -1.723733 5.707222 5.707222 5.707222 10.416556
## [733] 10.416556 -1.708400 -1.706167 -1.708100 -1.706167 -1.727433
## [739] -1.756183 -1.759450 8.995561 8.995561 -1.722283 -23.750000
## [745] -1.727433 -1.756183 -1.721067 -1.711267 -1.727150 -1.728467
## [751] 9.011023 9.011023 -1.727150 -1.784550 -1.708400 -1.703400
## [757] -1.703400 -1.723900 10.417500 -1.728467 -1.725550 10.416556
## [763] -23.750481 8.649823 8.649823 8.649823 NA NA
## [769] NA NA NA NA NA NA
## [775] 9.778326 NA NA NA NA NA
## [781] NA NA 8.405740 4.600000 -1.777133 10.400000
## [787] -5.066000 -5.066000 -3.800000 -3.800000 NA 10.409722
## [793] NA -17.351586 NA NA NA NA
## [799] NA NA NA NA NA NA
## [805] NA NA NA NA NA NA
## [811] 10.902333 10.902333 10.902333 NA NA NA
## [817] NA NA NA NA NA NA
## [823] NA NA NA NA 10.409722 10.409722
## [829] NA NA NA NA -23.433782 9.657730
## [835] 4.187754 NA -10.298600 NA 9.388318 4.600000
## [841] 4.600000 -4.585533 -4.585533 -26.300000 NA NA
## [847] 4.600000 -10.298600 NA 8.680656 NA 8.356504
## [853] NA NA 4.600000 4.600000 -14.556196 NA
## [859] NA NA NA 8.625444 -0.384444 NA
## [865] NA NA NA NA NA NA
## [871] NA NA 4.490934 4.490934 NA NA
## [877] 10.883267 NA 1.267222 NA NA NA
## [883] NA NA NA NA NA NA
## [889] NA NA NA NA NA NA
## [895] NA NA NA NA NA NA
## [901] NA NA NA NA NA NA
## [907] NA NA NA 9.675378 9.675378 -3.784611
## [913] -0.490695 6.630806 NA NA NA NA
## [919] 9.675378 NA NA NA NA NA
## [925] -10.340972 -10.340972 -9.906111 NA NA NA
## [931] NA NA NA 2.586896 2.586896 2.586896
## [937] NA NA 9.671765 NA NA NA
## [943] NA NA 8.480171 8.480171 -10.819120 -10.819120
## [949] 4.548917 4.548917 4.548917 NA -20.166667 NA
## [955] 5.661111 10.992609 10.992609 NA 10.539549 8.480171
## [961] 8.480171 8.480171 8.480171 4.660000 4.660000 4.660000
## [967] 4.660000 4.660000 4.660000 4.660000 4.660000 4.880000
## [973] 4.660000 4.660000 4.660000 10.992609 NA NA
## [979] NA -1.084083 -1.086528 NA NA NA
## [985] NA NA 8.640794 NA NA NA
## [991] NA NA NA NA NA NA
## [997] NA NA NA NA NA NA
## [1003] NA NA NA NA NA -1.902056
## [1009] NA NA NA NA NA NA
## [1015] 5.405556 5.405556 NA NA NA NA
## [1021] NA 4.579976 4.579976 -1.902056 NA NA
## [1027] NA NA NA NA NA NA
## [1033] NA NA NA NA NA NA
## [1039] NA NA NA NA NA NA
## [1045] NA NA NA NA NA NA
## [1051] NA NA NA NA NA NA
## [1057] NA NA NA NA NA NA
## [1063] NA NA NA NA NA NA
## [1069] NA NA NA NA NA NA
## [1075] NA NA NA NA NA NA
## [1081] NA NA NA NA NA NA
## [1087] NA NA NA NA NA NA
## [1093] NA NA NA NA NA NA
## [1099] NA NA NA NA NA 8.480171
## [1105] NA NA NA NA NA NA
## [1111] NA NA NA NA NA NA
## [1117] NA NA NA 5.661111 NA NA
## [1123] NA NA NA NA NA NA
## [1129] NA NA NA NA NA NA
## [1135] NA NA NA NA NA -10.000000
## [1141] -12.838000 NA NA NA NA NA
## [1147] -1.003821 -1.003821 NA NA NA NA
## [1153] NA NA NA NA NA NA
## [1159] NA NA NA NA NA NA
## [1165] 8.480171 8.480171 NA 8.479267 NA NA
## [1171] NA NA NA NA NA NA
## [1177] NA NA NA NA NA NA
## [1183] NA 8.480171 -8.169850 -10.000000 NA NA
## [1189] NA -10.000000 -3.686371 NA NA NA
## [1195] NA NA NA NA NA -9.296795
## [1201] -9.296795 NA NA NA NA NA
## [1207] NA NA NA NA 5.501527 NA
## [1213] -1.901770 -9.297680 NA NA NA NA
## [1219] 9.167817 NA NA 2.387500 NA 5.536706
## [1225] 5.536706 5.536706 5.536706 -9.296795 -9.296795 -9.296795
## [1231] -9.296795 NA NA NA -7.146177 -7.146177
## [1237] -7.146177 -7.146177 -7.146177 -7.146177 -7.146177 NA
## [1243] NA -9.296795 -9.296795 -9.296795 NA NA
## [1249] -3.789722 NA NA 4.200000 NA NA
## [1255] NA NA NA 5.449722 NA -17.783296
## [1261] -9.296795 -9.296795 -9.296795 -9.296795 5.319722 5.319722
## [1267] 5.416667 5.416667 4.200000 -10.417165 NA NA
## [1273] -1.908330 NA 9.156285 9.156285 9.156285 9.156285
## [1279] NA -1.908330 -1.908330 NA -1.904858 -1.904858
## [1285] -1.908330 4.993834 4.993834 -27.000000 NA -2.062350
## [1291] NA NA -22.209167 -22.209167 NA NA
## [1297] NA NA NA NA 1.267222 NA
## [1303] NA NA 5.633000 NA NA NA
## [1309] 5.787651 NA NA -10.000000 NA NA
## [1315] NA NA -2.091750 NA -16.277100 NA
## [1321] NA -1.901770 NA -1.901770 NA NA
## [1327] NA NA -9.379600 -16.183330 -15.023050 -16.183330
## [1333] -16.183330 -16.183330 NA -3.866700 NA NA
## [1339] NA NA -21.850471 -22.800466 NA NA
## [1345] NA NA NA NA NA NA
## [1351] NA NA -23.183777 NA NA NA
## [1357] NA NA NA NA NA NA
## [1363] NA NA NA NA NA NA
## [1369] NA NA NA NA -10.304670 -9.300000
## [1375] NA NA NA NA NA -5.000000
## [1381] NA NA NA NA NA NA
## [1387] NA NA NA NA NA NA
## [1393] NA NA NA NA NA NA
## [1399] NA NA NA NA NA NA
## [1405] NA NA NA NA NA NA
## [1411] NA NA NA NA NA NA
## [1417] NA NA NA NA NA NA
## [1423] NA NA NA NA NA NA
## [1429] NA NA NA NA NA NA
## [1435] NA NA -18.340000 -2.500000 NA NA
## [1441] 3.924040 NA NA NA NA NA
## [1447] NA NA NA NA NA NA
## [1453] NA NA NA NA NA NA
## [1459] NA NA NA NA NA NA
## [1465] NA NA NA NA NA NA
## [1471] NA NA NA NA NA NA
## [1477] NA NA NA NA NA NA
## [1483] NA NA NA NA NA NA
## [1489] NA NA NA NA NA NA
## [1495] NA NA -21.617168 -22.583766 NA -23.000476
## [1501] -23.000481 -22.250473 3.924040 NA NA NA
## [1507] 3.924040 NA 9.154720 9.154720 NA NA
## [1513] NA NA NA NA NA NA
## [1519] NA NA NA 6.000000 NA 3.914510
## [1525] 3.914510 3.924040 NA NA NA 3.857430
## [1531] 4.187754 NA NA NA NA NA
## [1537] NA NA NA NA NA NA
## [1543] NA NA NA NA NA NA
## [1549] NA NA NA NA NA NA
## [1555] NA NA NA NA NA NA
## [1561] NA NA NA NA NA NA
## [1567] NA NA NA NA NA NA
## [1573] NA NA NA NA NA NA
## [1579] NA NA NA NA NA NA
## [1585] NA NA 4.000000 4.000000 -1.887268 -10.000000
## [1591] -26.305575 NA -16.712000 -26.305575 5.209218 5.209218
## [1597] NA NA NA NA NA NA
## [1603] NA NA NA NA NA NA
## [1609] NA NA NA NA NA NA
## [1615] NA NA NA NA NA NA
## [1621] NA NA NA NA NA NA
## [1627] NA NA NA NA NA NA
## [1633] NA NA NA NA NA NA
## [1639] NA NA NA NA NA NA
## [1645] NA NA NA NA NA NA
## [1651] NA NA NA NA NA NA
## [1657] NA NA NA NA NA NA
## [1663] NA NA NA NA -4.240965 -4.240965
## [1669] NA NA NA NA NA NA
## [1675] NA NA NA NA NA NA
## [1681] NA NA NA NA NA NA
## [1687] NA NA NA NA NA NA
## [1693] NA NA NA NA NA NA
## [1699] NA NA NA NA NA NA
## [1705] NA NA NA NA NA NA
## [1711] NA NA NA NA NA NA
## [1717] 5.189780 NA -1.901770 -10.960990 -16.277100 -16.277100
## [1723] NA 3.878300 -1.901770 -12.499640 -16.277100 -3.368410
## [1729] NA -16.277100 -10.960990 -10.960990 14.978780 -2.293250
## [1735] -9.428000 -10.994240 -10.994240 -10.960990 -9.956720 -5.239390
## [1741] -4.000000 7.233330 -1.648620 -3.368410 -2.293250 -2.450630
## [1747] 5.189780 3.878300 -2.450630 3.878300 -22.690000 NA
## [1753] -1.592240 -21.975100 NA NA -21.975100 NA
## [1759] 4.633300 8.952480 -9.428000 -9.900000 -8.360000 NA
## [1765] NA NA NA NA NA NA
## [1771] NA NA NA NA NA NA
## [1777] NA NA NA NA NA NA
## [1783] NA NA NA NA NA -1.084083
## [1789] -4.267222 -3.422167 4.860417 4.860417 3.933889 6.804611
## [1795] -3.106417 -3.106417 NA -10.865611 -10.340972 -1.902056
## [1801] -9.906861 -10.299083 -4.267222 -11.494944 -11.505722 -10.340972
## [1807] NA -3.744667 -0.664917 NA -14.235000 -27.596917
## [1813] -11.505722 -26.435528 -11.505722 -2.454944 -3.784611 -11.808278
## [1819] -3.753361 -3.744667 -0.664917 -4.830333 -4.830333 -1.902056
## [1825] NA -1.902056 -1.902056 NA -2.018139 NA
## [1831] -1.998139 -1.902056 4.015028 NA NA NA
## [1837] NA NA NA NA NA NA
## [1843] NA NA NA NA NA NA
## [1849] NA NA NA NA NA NA
## [1855] NA NA NA NA NA NA
## [1861] NA -9.747699 NA NA NA NA
## [1867] NA NA NA NA NA NA
## [1873] -17.457149 5.501527
Plus, we don’t need to use all the data/columns that are provided by GBIF for our mapping purposes. So, a common step in any workflow is to make sure you have the cleanest dataset possible.
First, we’re going to create a new dataframe that just has a few of the columns, the ones most relevant to our project:
df <- df[,c("species","continent","country","adm1", "basisOfRecord", "lat","lon")]
Next, we’re going to remove all the occurrences that don’t have latitude and longitude data.
df <- subset(df,!is.na(df$lon) & !is.na(df$lat))
nrow(df) #how many data points do we have now?
## [1] 1040
Then we transform all the negative longitude values so that the range goes from 0 to 360 instead of from -180 to 180. This will allow us to plot it on our map. We will add this as an extra column in “df” so that we can use either version.
westlongitudes <- which(df$lon < 0)
df[,"lon360"] <- df[,"lon"]
df[westlongitudes,"lon360"] <- 360 + df[westlongitudes,"lon"]
#Do you understand how these three lines work?
Next, we make a simple map to look for errors:
require(maps) #load the mapping library
## Loading required package: maps
map("world2",col = "darkgray") #generate the map
map.axes() #label the axes (longitude and latitude values)
points(df$lon360,df$lat,col = "red",pch = 20) #plot the species occurrence points
This will be easier to read if we make it so the map shows only the part of the Earth where GBIF has occurrence records for our species.
map("world2",col = "darkgray",
xlim = range(df$lon360,na.rm = T) + c(-1,1), #one extra degree on each side for visibility
ylim = range(df$lat,na.rm = T) + c(-1,1))
points(df$lon360,df$lat,col = "red",pch = 20)
map.axes()
In the example of the blue morpho butterflies, you can see that almost all the occurrences are from the tropical parts of South and Central America, but there are also a few others in Europe and Oceania. This could be that GBIF keeps track of not just verified scientific occurrences - they also store information on museum specimens as well as community science human observations. So let’s see what kinds of data points are included in your data set, and how many of them?
table(df[,"basisOfRecord"])
##
## HUMAN_OBSERVATION MATERIAL_SAMPLE OCCURRENCE PRESERVED_SPECIMEN
## 544 16 24 456
But we don’t want to include all of these samples in our range map – we’re trying to look at the actual habitat range of the living species. We should make sure we’re only dealing with live observations, not with fossil or preserved specimens. Which points are /not/ from observations of living animals?
notobs <- which(!(df$basisOfRecord == "HUMAN_OBSERVATION" | df$basisOfRecord == "OBSERVATION" | df$basisOfRecord == "OCCURRENCE"))
map("world2",col = "darkgray",
xlim = range(df$lon360,na.rm = T) + c(-1,1),
ylim = range(df$lat,na.rm = T) + c(-1,1))
points(df$lon360,df$lat,col = "red",pch = 20)
points(df[notobs,]$lon360,df[notobs,]$lat,col = "black",pch = 21)
map.axes()
The points that aren’t from actual observations of living butterflies are circled in black. We’ll remove these points:
remove <- notobs
df <- df[-remove,] #remove the incorrect points
rm(remove)
nrow(df) #how many left now?
## [1] 568
Plot what’s left again to see if anything looks like it’s in the wrong place. Then we’ll plot the data again to make sure there’s nothing else that stands out as probably incorrect:
map("world2",col = "darkgray",
xlim = range(df$lon360,na.rm = T) + c(-1,1),
ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20) #plot again with only the real data
Recall what you know about your species’ range. Do any of these occurrences look like they might be errors?
Data ‘cleaning’ is particularly important for data sourced from species distribution data warehouses such as GBIF. Such efforts do not specifically gather data for the purpose of species distribution modeling, so you need to understand the data and clean them appropriately, for your application.
My example species, the blue morpho butterfly Morpho menelaus, lives in South and Central American tropical rainforests. The points in Northern Europe seem pretty suspicious, on that basis; maybe they’re tagged incorrectly, and are actually captive individuals in a zoo, or even dead preserved specimens? Maybe someone incorrectly entered the latitude and longitude of the museum into the collection information? If you have data points in suspicious locations, take a look at them by filtering the latitude or longitude:
test1 <- which(df$lon360 < 250) #tagging all points that aren't in the Americas, by longitude
map("world2",col = "darkgray",
xlim = range(df$lon360,na.rm = T) + c(-1,1),
ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20)
points(df[test1,]$lon360,df[test1,]$lat,col = "black",pch = 21) #circle the flagged points in black
All the points we’ve identified as being in the wrong place are now circled in black. What can we find out about them?
df[test1,]
## species continent country adm1 basisOfRecord
## 56 Morpho menelaus EUROPE Belgium Flemish Brabant HUMAN_OBSERVATION
## 119 Morpho menelaus EUROPE Belgium Antwerp HUMAN_OBSERVATION
## 318 Morpho menelaus EUROPE Belgium Antwerp HUMAN_OBSERVATION
## 319 Morpho menelaus EUROPE Belgium Antwerp HUMAN_OBSERVATION
## lat lon lon360
## 56 50.99278 4.50166 4.50166
## 119 51.21534 4.42175 4.42175
## 318 51.21180 4.41615 4.41615
## 319 51.21180 4.41615 4.41615
These butterflies are in Antwerp, in Belgium, where there is a very famous zoo – and when I search for information about it, it appears it has a butterfly garden! I suspect these are captive specimens, so I want to exclude them from my data set.
remove <- c(test1)
df <- df[-remove,] #remove the incorrect points
rm(remove)
What’s left?
map("world2",col = "darkgray",
xlim = range(df$lon360,na.rm = T) + c(-1,1),
ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20)
Those all look like reasonable places for blue morphos to live. Keep cleaning yours until you’ve gotten rid of any other data points that make no sense.
In a longer-term research project intended for publication, you would spend a lot more time on the data cleaning step, and indeed there are programs and functions for doing exactly that, but for today let’s leave it here.
Now, how should we visualize the species range? We’ll start by drawing a polygon that encloses all the points (this is called a “hull”).
require(sf); require(concaveman) #load mapping libraries
## Loading required package: sf
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
## Loading required package: concaveman
sfdata <- st_as_sf(df,coords = c("lon360","lat")) #this reformats the coordinate points into a special data structure
conc <- concaveman(sfdata,concavity = 3,length_threshold = 0) #this is called a concave hull, it's a polygon that contains all the points
conv <- convHull(df[,c("lon360","lat")]) #this is called a convex hull, it's just a polygon drawn around all the points that stick out the most
Then make a map that shows the concave and convex hulls:
map("world2",col = "darkgray",
xlim = range(df$lon360,na.rm = T) + c(-1,1),
ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20)
plot(conv,add = T,col = rgb(1,1,0,0.3),lty = "blank")
plot(conc,add = T,col = rgb(1,0,0,0.3),lty = "blank")
legend("topright",col = c(rgb(1,1,0,0.3),rgb(1,0,0,0.3)),
legend = c("convex","concave"),pch = 15,bty = "n")
This isn’t very satisfactory as a map of species range, as it doesn’t consider whether your species could actually live in all the ‘potentially unoccupied’ places in between the points you plotted. In the next part we’ll look at some environmental data to see if we can figure out a better way.
To save your map to your class file, click Export>Save as Image. Give it a name that contains the species name and your name.
Download the climatic data from the WorldClim website.
require(geodata); require(raster);require(here)
## Loading required package: geodata
## Loading required package: terra
## terra 1.8.54
## Loading required package: here
## here() starts at /Users/jblois/Documents/GitHub/biodata_shortcourse/development
climate <- worldclim_global(var = 'bio',res = 2.5,path = here())
climate <- stack(climate)
The variable climate now contains a special data
structure called a “RasterStack”, which consists of some number of
matrices of exactly the same dimensions. (Think of it like a neatly
aligned stack of maps.)
names(climate) #these names are annoyingly long, let's rename them
## [1] "wc2.1_2.5m_bio_1" "wc2.1_2.5m_bio_2" "wc2.1_2.5m_bio_3"
## [4] "wc2.1_2.5m_bio_4" "wc2.1_2.5m_bio_5" "wc2.1_2.5m_bio_6"
## [7] "wc2.1_2.5m_bio_7" "wc2.1_2.5m_bio_8" "wc2.1_2.5m_bio_9"
## [10] "wc2.1_2.5m_bio_10" "wc2.1_2.5m_bio_11" "wc2.1_2.5m_bio_12"
## [13] "wc2.1_2.5m_bio_13" "wc2.1_2.5m_bio_14" "wc2.1_2.5m_bio_15"
## [16] "wc2.1_2.5m_bio_16" "wc2.1_2.5m_bio_17" "wc2.1_2.5m_bio_18"
## [19] "wc2.1_2.5m_bio_19"
names(climate) <- unlist(sapply(1:19,function(x) paste0("bio",x)))
names(climate)
## [1] "bio1" "bio2" "bio3" "bio4" "bio5" "bio6" "bio7" "bio8" "bio9"
## [10] "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17" "bio18"
## [19] "bio19"
In the case of this climate data file that we just downloaded, those maps contain the values of 19 different climatic variables that are frequently relevant to species distributions, for all the land surface in the whole world (not the oceans).
You can plot any one of the layers to have a look at it. Call it by its name, using the $ operator, as an argument to the plot() function.
plot(climate$bio1)
This layer, bio1, is the average annual temperature. To
see what each of the 19 bioclimatic variables means, look at https://www.worldclim.org/data/bioclim.html. Temperature
measurements are given in tenths of a degree Celsius; precipitation is
in millimeters.
Then you can plot your own species occurrence data on top of it, restricting the range of the map to the range of your occurrences plus 1 degree in each direction, the same way we did in Part 3. The climate data layers report longitude as going from -180 to 180, so we have to go back to the original longitude column (“lon”, not “lon360”):
plot(climate$bio1,
xlim = range(df$lon,na.rm = T) + c(-1,1),
ylim = range(df$lat,na.rm = T) + c(-1,1))
points(df$lon,df$lat,col = "red",pch = 20)
Overlay your species occurrences with each of the different
bioclimatic data layers in bioclim (bio1
through bio19). - Do any of the bioclimatic variables seem
to be important in controlling the range of your species? - If so, which
ones? Save the images to your class folder for later reference. - What
do you think about this? – Are you surprised by the results? – Can you
think of a reason why these particular climatic variables might have a
lot to do with the possible range of your species?
Tomorrow we’ll develop a quantitative model with these data to answer these questions!
Save your data so you can load it again tomorrow. This is not straightforward on UC Merced computer lab computers, so please follow ALL of the following steps:
Your instructor will make sure these files are here for you to load tomorrow morning.